A few segments have significant gene enrichments:
| Community | Term |
|---|---|
| II | acetyl-CoA transmembrane transporter activity |
| IV-A | Factor: Gal4p, Factor: Gal4p, Factor: SWI4P |
| VII-B | Factor: DAL81 |
| X | Factor: Mbp1p, Factor: ARG |
Statistic: average pairwise histone-space Euclidean distance.
Only the larger clusters produce enrichments:
| Term | PAdj | |
|---|---|---|
| D | Terpenoid backbone biosynthesis | 4% |
| E | Ribosomal small subunit binding | 5% |
| F | Factor: HSF; motif: AGAANAGAANAGAAN | 2% |
| G | Factor: HSF; motif: NTTCTAGAANAGAAN | 1% |
Statistic: Average pairwise Jaccard distance in motif presence / absence in a sphere. Double-tailed p-value. So these are both spheres of lower and greater than normal Jaccard distance.
Cluster A produces a rich g:profiler result because of 4 “special” "genes it contains: YLR160C, YLR158C, YLR155C, YLR157C.
| Term | PAdj | TF motif enrichment | |
|---|---|---|---|
| A | Cyanoamino acid metabolism | 0.001% | |
| B | ATPase-coupled transmembrane transporter activity | 2% | Factor: MAC1; motif: TTTGCTCAM |
| C | Regulation of attachment of spindle microtubules to kinetochore | 0.9% | |
| D | Factor: GCR2; motif: GCTTCCN |
Coexpression is highly correlated with proximity. Roughly half the genome exhibits strong (PAdj < 1%) local coexpression significance.
There appear some enrichments in the larger clusters, but these are too generic (“intracellular anatomical structure” and such).
These seem to be randomly distributed in 3D space:
We used the haploid genome model by Duan et al. to extract positions of all Sacharomyces cerevisiae genes.
Not a static configuration. Position implies “preferred neighborhood”.
We have observed and annotated three rough regions of the genome:
Centroid of the “bubble”: 108.823, 71.102, 81.404
SELECT Gene, sqrt((x - 108.823) * (x - 108.823) + (y - 71.102) * (y - 71.102) + (z - 81.404) * (z - 81.404)) AS Radioid FROM Loci WHERE z < 175
It turns out that the bubble is hollow:
Bubble radioids
Average distance from the centroid is: 56.148
CREATE TABLE Labels AS
SELECT Gene,
CASE
WHEN z > 175 THEN "Hat"
ELSE CASE
WHEN (x - 108.823) * (x - 108.823) + (y - 71.102) * (y - 71.102) + (z - 81.404) * (z - 81.404) > 3152.644 THEN "External"
ELSE "Internal"
END
END Label
FROM Loci
conn <- dbConnect(RSQLite::SQLite(), "../../Results/yeast.sqlite")
genePositionsLabels <- dbGetQuery(conn, "SELECT Loci.Gene, x, y, z, Label FROM
Loci JOIN RoughFeatures ON Loci.Gene = RoughFeatures.Gene")
plot_ly(x=genePositionsLabels$x, y=genePositionsLabels$y, z=genePositionsLabels$z, type="scatter3d", mode="markers", text=genePositionsLabels$Gene, color=genePositionsLabels$Label)
dbDisconnect(conn)
Histone modifications are agnostic to our structural domain labels.
Histones agnostic to inside vs outside
The input file consists of rows, each beginning with a gene name and followed by tab-separated real numbers forming the columns. Each column corresponds to a histone modification.
The numbers are in z-score across all columns.
A lot of cells have missing values. I think this means “no reads” or just a few reads. Need to clarify this.
Citation required here: what is the origin of the file?
During visual inspection of the input file, we notice that a lot of the columns are too sparse.
Sparse histone modifications
Regardless of the way we will choose to handle these gaps, we would need to introduce some assumption and noise to our data. Se we decided to not take these histone modifications into consideration.
We have thus selected the following histone modifications to work with:
These still have occassional missing values.
Occassional missing values
Consider each gene as a vector of 9 dimensions. We patch the missing value using the following rule: * Find the 3 closest vectors (Euclidean distance) which happen to have the value populated. * Use the median of the 3.
The advantages of this method are:
Patched data:
We visualize the distance (as a heatmap), between each gene and every other gene.
All chromosomes yield a similar picture.
The black diagonal corresponds to a zero distance (all genes have a zero distance to themselves).
The first thing one notices in these pictures are the distinct red parallel lines (horizontal and vertical). A red line marks a gene with a particular histone profile, which is relatively unique (it differs from all other genes), thus the distance from all the other genes is significant.
The interesting thing though is that we get these unique profiles in clusters in the linear chromosome. This manifests as a black rectangle in the intersection of two thick red lines:
Community in distance map
We will call these local groups “Histone communities”, as they consist of genes with similar and unique histone modifications that converge together in space.
We have calculated the “community score” for each gene.
The score is defined as: A / N, where:
We bias the denominator to be 0.17 as a minimum. This corresponds to average of all N minus 3 standard deviations, so as to prevent divisions by zero. The 3 major spikes that can be seen in the picture below are such instances.
We consider a value of ‘2’ as significant. In other words, a gene being twice as similar as their neighbours than its similarity to all other genes in the chromosome is considered a “community center” gene. Here is a hierarchical clustering of the histone modifications of community centers:
We have chosen the community center genes in such a way that we expect clear-cut clusters to be formed in histone space. Plotting the within-cluster total sum of squares we expect to see a clear-cut elbow.
It is not easy to decide where the elbow is. I would say it is on 3 clusters, but there’s a lot more variance to be explained below that. Searching for a less ambiguous method, I stumbled upon x-means.
x-means clustering algorithm outputs not only the k clusters, but also k itself. It does so by optimizing an objective function - the Bayesian information criterion - which reveals the mosto likely k. Experiments by the authors have shown it to reveal the true model underlying the data.
bicOptimizedClusters <- Xmeans(centers, kmax=10)
Cluster <- bicOptimizedClusters$cluster
centers <- cbind(centers, Cluster)
Cluster
## [1] 6 7 6 1 7 2 3 4 4 4 4 6 7 7 7 6 2 2 3 6 7 4 6 5 6 4 7 6 4 4 4 7 7 5 5 5 5
## [38] 6 4 4 6 4 6 5 5 3 3 3 6 6 3 3 6 7 7 7 7 4 4 3 1 7 6 6 6 6 4 6 6 6 4 6 4 7
## [75] 7 7 7 5 5 7 7 6 4 6 7 7 7 2 6 2 7 7 5 2 4 4 4 7 7 7 6 4 7 6 6 6 7 6 2 2 2
## [112] 6 4 4 6 6 6 1 1 4 7 7 7 7 6 6 4 4 4 6 6 3 3 3 3 4 4 4 4 7 4 7 7 7 6 6 2 2
## [149] 2 6 7 6 6 3 3 7 4 7 7 5 7 7 2 7 6 6 7 7 6 4 7 7 6 3 3 6 6 1 6 4 4 7 6 4 7
## [186] 4 2 2 6 6 5 7 7 7 4 7 7 3 3 2 7 7 6
bicOptimizedClusters$size
## [1] 5 16 18 40 12 54 58
x-means comes up with k=7, which explains most of the variance. Beyond 7 clusters it is evudebt that there’s not much to be gained.
Saving the community centers to database in CommunityCenters table, containing Gene name and Community (integer).
# Insert community centers and corresponding cluster ids to db
conn <- dbConnect(RSQLite::SQLite(), "../../Results/yeast.sqlite")
dbExecute(conn, "DROP TABLE IF EXISTS CommunityCenters")
dbExecute(conn, "CREATE TABLE CommunityCenters(Gene TEXT PRIMARY KEY, Community INTEGER)")
for (i in 1:nrow(centers)) {
sql <- paste("INSERT INTO CommunityCenters(Gene, Community) VALUES('", row.names(centers)[i], "',", centers[i, 10], ")", sep='')
dbExecute(conn, sql)
}
dbDisconnect(conn)
In this plot we only see the community centers. Each dot represents a gene and implies a group of 5 genes around it, having a similar histone profile.
Subsequently, we are going to classify the entire genome using these centers as a scaffold, in histone space.
After classifying all genes using the above scaffold and knn algorithm (k=3), we can plot them in space.
Communities include a central histone-space cluster (aroung zero in all dimensions), which allows us to classify the entire set. In other words, one of the “extremes” is actually the center of histone space, distant from all other community centers.
Here is how the communities look in the linear chromosome. Genes are displayed in succession where each gene has a width of 1 pixel:
Communities on linear chromosome
We sample windows of 25 consecutive genes and measure Shannon entropy. Then we sample “clouds” of 25 uniformly random genes (with replacement) and compare the distributions of entropy.
Shannon entropy distributions for consecutive and cloud (random sets of) genes
We used a total of 100000 random samples for both windows and clouds.
Unlike the “cloud” distribution which is normal, the “window” distribution differs both in shape and in its displacement towards smaller values. Its multimodal shape hints a hierarchy of structure, starting from a set of very low-entropy windows (around 1.2 bits of entropy), two more less structured and more abundant sets at 1.5 and 1.7 bits and finally, after the average has been reached, the distribution collapses to normality.
Probably most interesting result is the first step of the staircase. This corresponds to a set of windows having around 1.2 bits of entropy. These account for around 1% of the samples.
The entire bubble lies in an area of low (<0.001) p-value with respect to the cloud distribution. An entropy limit of 1.28 bits corresponds to that p-value. We can filter our windows based on that.
Windows of 25 genes, filtered for a maximum of 1.28 bits of entropy.
Tight communities on linear chromosome
We now have sets of genes having both:
Does this imply proximity in 3D space?
To answer this, we measure the Euclidean distance in 3D space between “tight community” genes and between random pairs of genes. To remove the advantage of the first group, which is that they are consecutive by definition, we only sample pairs from distinct chromosomes.
Distance distributions of tight communities and random gene pairs. 28000 samples taken for each category.
The resulting p-value is approximately 0.57. Community genes do not appear to be correlated in 3D space. The multiple peaks are expected; each represents the interaction of a pair of communities.
Communities are named using a combination of the chromosome name and the optional suffix of -A or -B, which signifies the first or second cluster found in given chromosome. A given chromosome may contain zero, one or two clusters.
YBR203W YBR204C YBR205W YBR206W YBR207W YBR208C YBR209W YBR210W YBR211C YBR212W YBR213W YBR214W YBR215W YBR216C YBR217W YBR218C YBR219C YBR220C YBR221C YBR222C YBR223C YBR225W YBR226C YBR227C YBR228W YBR229C YBR230C YBR231C YBR232C YBR233W YBR234C YBR235W YBR236C YBR237W
YDL013W YDL012C YDL011C YDL010W YDL009C YDL008W YDL007W YDL006W YDL005C YDL004W YDL003W YDL002C YDL001W YDR001C YDR002W YDR003W YDR004W YDR005C YDR006C YDR007W YDR008C YDR009W YDR010C YDR011W YDR012W YDR013W YDR014W YDR015C YDR016C YDR017C YDR018C YDR019C
YDR078C YDR079W YDR080W YDR081C YDR082W YDR083W YDR084C YDR085C YDR086C YDR087C YDR088C YDR089W YDR090C YDR091C YDR092W YDR093W YDR094W YDR095C YDR096W YDR097C YDR098C YDR099W YDR100W YDR101C YDR102C YDR103W YDR104C YDR105C YDR106W
YGL110C YGL109W YGL108C YGL107C YGL106W YGL105W YGL104C YGL103W YGL102C YGL101W YGL100W YGL099W YGL098W YGL097W YGL096W YGL095C YGL094C YGL093W YGL092W YGL091C YGL090W YGL089C YGL088W YGL087C YGL086W YGL085W YGL084C YGL083W YGL082W YGL081W YGL080W YGL079W YGL078C YGL077C YGL076C YGL075C YGL074C YGL073W YGL072C YGL071W
YGL004C YGL003C YGL002W YGL001C YGR001C YGR002C YGR003W YGR004W YGR005C YGR006W YGR007W YGR008C YGR009C YGR010W YGR011W YGR012W YGR013W YGR014W YGR015C YGR016W YGR017W YGR018C YGR019W YGR020C YGR021W
YJR003C YJR004C YJR005W YJR006W YJR007W YJR008W YJR009C YJR010W YJR011C YJR012C YJR013W YJR014W YJR015W YJR016C YJR017C YJR018W YJR019C YJR020W YJR021C YJR022W YJR023C YJR024C YJR025C YJR026W YJR027W YJR029W YJR028W YJR030C YJR031C YJR032W YJR033C YJR034W YJR035W YJR036C YJR037W YJR038C YJR039W YJR040W YJR041C YJR042W YJR043C YJR044C YJR045C YJR046W YJR047C YJR048W YJR049C YJR050W YJR051W YJR052W YJR053W YJR054W YJR055W YJR056C YJR057W
YKL145W YKL144C YKL143W YKL142W YKL141W YKL140W YKL139W YKL138C YKL137W YKL136W YKL135C YKL134C YKL133C YKL132C YKL131W YKL130C YKL129C YKL128C YKL127W YKL126W YKL125W YKL124W YKL123W YKL122C YKL121W
YKR010C YKR011C YKR012C YKR013W YKR014C YKR015C YKR016W YKR017C YKR018C YKR019C YKR020W YKR021W YKR022C YKR023W YKR024C YKR025W YKR026C YKR027W YKR028W YKR029C YKR030W YKR031C YKR032W YKR033C YKR034W YKR035C YKR036C YKR037C YKR038C YKR039W YKR040C YKR041W
YML122C YML121W YML120C YML119W YML118W YML117W YML116W YML115C YML114C YML113W YML112W YML111W YML110C YML109W YML108W YML107C YML106W YML105C YML104C YML103C YML102W YML101C YML100W YML099C YML098W YML097C YML096W YML095C YML094W YML093W YML092C YML091C YML090W YML089C YML088W YML087C YML086C
YNL177C YNL176C YNL175C YNL174W YNL173C YNL172W YNL171C YNL170W YNL169C YNL168C YNL167C YNL166C YNL165W YNL164C YNL163C YNL162W YNL161W YNL160W YNL159C YNL158W YNL157W YNL156C YNL155W YNL154C YNL153C YNL152W YNL151C YNL150W YNL149C YNL148C YNL147W YNL146W YNL145W YNL144C YNL143C YNL142W YNL141W YNL140C YNL139C YNL138W YNL137C YNL136W YNL135C
YNL064C YNL063W YNL062C YNL061W YNL059C YNL058C YNL057W YNL056W YNL055C YNL054W YNL053W YNL052W YNL051W YNL050C YNL049C YNL048W YNL047C YNL046W YNL045W YNL044W YNL043C YNL042W YNL041C YNL040W YNL039W
YPR041W YPR042C YPR043W YPR044C YPR045C YPR046W YPR047W YPR048W YPR049C YPR050C YPR051W YPR052C YPR053C YPR054W YPR055W YPR056W YPR057W YPR058W YPR059C YPR060C YPR061C YPR062W YPR063C YPR064W YPR065W YPR066W YPR067W YPR068C YPR069C YPR070W
In order to establish citeria for enrichment p-values returned by g:profiler, we first input random sets of 30 genes and note the returned p-value (or ‘1’ if nothing was returned):
0.008 0.015 0.021 0.023 0.03 0.041 0.05 1 1 1 1 1 1 1 1 1 1 1 1 1
Our criteria for inclusion of g:profiler results become:
| Community | Term |
|---|---|
| II | acetyl-CoA transmembrane transporter activity |
| IV-A | Factor: Gal4p, Factor: Gal4p, Factor: SWI4P |
| IV-B | |
| VII-A | |
| VII-B | Factor: DAL81 |
| X | Factor: Mbp1p, Factor: ARG |
| XI-A | |
| XI-B | |
| XIII | |
| XIV-A | |
| XIV-B | |
| XVI |
We shall now describe a generic framework which was utilized throughout the rest of this work, to segment the three-dimensional genome into areas of interest regarding certain local properties of genes and chromatin in general. We will refer to this framework as a “sphere test”.
We sample sets of genes using randomly placed spheres of a given radius. A commonly used value for the radius is 15 units. This allows for sampling sets of up to approximately 150 genes. We then set a minimum acceptable size for a sample. Usually this is set to 50 genes. We do so to ensure that our calculations will be performed on a gene set of suitable size to do statistics upon.
Sampling procedure
We then calculate a given metric on each set of sampled genes. Properties like histone modification frequencies, presence or absence of transcription factor binding motifs, gene coexpression scores et cetera -collectively referred to as “signals”- are used as inputs to measure a statistical property of the local set of genes collected by each sphere.
We then proceed to calculate the same metric for random sets of genes, having the same gene count as the sphere-sample. We calculate a p-value as the ratio of random occurrences that are equal or more extreme to the metric produced by the sphere. Afterwards, we adjust the p-values by applying the Benjamini-Hochberg method.
We set a limit for the adjusted p-value, usually 1% and filter the sphere-samples below this limit. These sphere-samples are then considered to exhibit a significantly localized aspect of the signal.
The resulting sets of genes frequently overlap. We consider overlapping samples as belonging to the same local set of genes. A limit of a minimum 5% overlap is applied. Sets connected with less than 5% of their genes are considered distinct.
We use an iterative approach to clustering. It can be summarized as follows:
Hierarchical clustering
Sphere-test data flow
Our first sphere-test measures proximity in histone-space. Histone-space is a 9-dimensional space, where each dimension corresponds to one of our selected promoter-site histone modifications. A point in that space corresponds to a particular histone modification profile.
The metric for this test is the average distance in histone space, defined as follows.
Given a set of genes G, we denote as P the set of all possible unordered pairs of G. A histone distance is defined for a pair as the Euclidean distance in histone-space for the two genes of the pair. Then, the average histone distance in P is our metric for this particular sphere-test.
It follows that, a set of genes scoring low in this metric contains a more consistent histone modification profile than one scoring higher.
We employ a FDR of 1% to determine statistically significant samples, and this result in seven distinct (zero overlaps) local clusters of very coherent histone modification profiles. Total gene count is approximately 1000, which roughly represents 1/6th of the yeast genome.
sql <- "SELECT l.*, CASE WHEN Field IS NULL THEN 'None' ELSE Field END AS Field FROM Loci l LEFT JOIN PromoterFields f ON l.Gene = f.Gene"
fieldColors <- c("#FF0000", "#00FF00", "#0000FF", "#FFFF00", "#00FFFF", "#4488AA", "#FF00FF", "#000000")
conn <- dbConnect(RSQLite::SQLite(), "../../Results/yeast.sqlite")
promoterFields <- dbGetQuery(conn, sql)
plot_ly(x=promoterFields$x, y=promoterFields$y, z=promoterFields$z, type="scatter3d", mode="markers", text=promoterFields$Gene, color=promoterFields$Field, colors=fieldColors)
dbDisconnect(conn)
Promoter fields pepetrate deep into the nucleus, thus not differentiating between “Internal” and “External” rough features:
sql <- "SELECT l.*, CASE WHEN Field IS NULL THEN 'None' ELSE Field END AS Field FROM Loci l LEFT JOIN PromoterFields f ON l.Gene = f.Gene WHERE l.Gene IN (SELECT Gene FROM RoughFeatures WHERE Label='Internal')"
fieldColors <- c("#FF0000", "#00FF00", "#0000FF", "#FFFF00", "#00FFFF", "#4488AA", "#FF00FF", "#000000")
conn <- dbConnect(RSQLite::SQLite(), "../../Results/yeast.sqlite")
promoterFields <- dbGetQuery(conn, sql)
plot_ly(x=promoterFields$x, y=promoterFields$y, z=promoterFields$z, type="scatter3d", mode="markers", text=promoterFields$Gene, color=promoterFields$Field, colors=fieldColors)
dbDisconnect(conn)
We then attempt a pair plot of all 9 histone modifications and their associated classes.
# Disabling because this takes a long time to produce. Add ```{r} to enable again.
conn <- dbConnect(RSQLite::SQLite(), "../../Results/yeast.sqlite")
histoneProfiles <- dbGetQuery(conn, "SELECT H3K4, H3K36, H3NTerm, H4NTerm, H2A, H2B, H3, H4, H2AZ, Field FROM HistonesPromoterPatched h JOIN PromoterFields p ON h.Gene = p.Gene")
dbDisconnect(conn)
use_python('C:\\Users\\ProteinMetrics\\AppData\\Local\\Programs\\Python\\Python38\\python.exe')
sns <- import('seaborn')
plt <- import('matplotlib.pyplot')
pd <- import('pandas')
plot <- sns$pairplot(r_to_py(histoneProfiles), hue = 'Field')
plot$savefig("images/PromoterFieldsHistonePairPlot.png")
Histone modification profiles pair plot
Calling the histone profiles by absolute value is not possible. All distributions are almost identical to the overall histone modification distribution.
What differs, however, is a gentle displacement of the average in each dimension. The distinguishing characteristic of our clusters is the accumulation of such tendencies in multiple dimensions. This lowers the average histone profile distance to statistically significant levels.
The natural way to represent our data is therefore a quantification of the gentle shifts for each cluster.
For each histone modification, we calculate the average of each group. Then we consider the distribution of the averages and assign a z-score to each group. The 9-vector of z-scores will then characterize the collective tendencies of the group.
colors <- c("#FF0000", "#FF0000", "#000000", "#00FF00", "#00FF00")
colors <- colorRampPalette(colors)(25)
tendencies <- read.table(
"../../Results/PromoterFieldsTendencies.tsv",
sep="\t", header=TRUE, row.names=1)
tendenciesScaled <- scale(tendencies)
heatmap.2(tendenciesScaled, col=colors)
Observations:
The table below summarizes the more prominent tendencies for Over- and Under-representation observed in each cluster:
| Over | Under | |
|---|---|---|
| A | H3K36, H3, H4 | |
| B | H2A, H2B | H3K4, H3K36, H3NTerm |
| C | H3K4 | |
| D | (H3NTerm) | (H2A) |
| E | H4NTerm, H3, H2AZ | |
| F | H2AZ | H2A |
| G | H3NTerm, H4NTerm | H2B, H4 |
Tendencies of cluster D have been parenthesized, to show that these are not most dominant among the group. Rest of the table contains most dominant tendencies per histone variant.
What makes the clusters statistically significant is not a substantial shift in histone modification profiles. It is a subtle tendency towards over- or under-representation of each histone variant. Statistical significance is therefore explained by the entire gene group’s decision to move towards one or the other side. We didn’t select the groups for this. The statistical test was histone-space compactness. Thus, the groups must be of reduced variance. Let’s add a figure to illustrate that.
Only the larger clusters produce enriched terms:
| Term | PAdj | |
|---|---|---|
| D | Terpenoid backbone biosynthesis | 4% |
| E | Ribosomal small subunit binding | 5% |
| F | Factor: HSF; motif: AGAANAGAANAGAAN | 2% |
| G | Factor: HSF; motif: NTTCTAGAANAGAAN | 1% |
Same test structure, different metric.
We now measure average pairwise Jaccard distance in presence / absence of 102 transcription factor motifs found in a gene group.
sql <- "SELECT l.*, CASE WHEN Field IS NULL THEN 'None' ELSE Field END AS Field FROM Loci l LEFT JOIN MotifFields f ON l.Gene = f.Gene"
fieldColors <- c("#FF0000", "#00FF00", "#0000FF", "#FFFF00", "#00FFFF", "#4488AA", "#FF00FF", "#000000")
conn <- dbConnect(RSQLite::SQLite(), "../../Results/yeast.sqlite")
motifFields <- dbGetQuery(conn, sql)
plot_ly(x=motifFields$x, y=motifFields$y, z=motifFields$z, type="scatter3d", mode="markers", text=motifFields$Gene, color=motifFields$Field, colors=fieldColors)
dbDisconnect(conn)
MotifFields:A overlaps 96.5% with Tethys.
Using Jaccard index instead of distance, we get more significant p-values. Second tail (nwar p-value 1) is still present but amaller). We can now afford to use an FDR of 1% (previously 5%).
sql <- "SELECT l.*, CASE WHEN Field IS NULL THEN 'None' ELSE Field END AS Field FROM Loci l LEFT JOIN JaccardIndexMotifFields f ON l.Gene = f.Gene"
fieldColors <- c("#FF0000", "#00FF00", "#0000FF", "#FFFF00", "#00FFFF", "#4488AA", "#FF00FF", "#000000")
conn <- dbConnect(RSQLite::SQLite(), "../../Results/yeast.sqlite")
motifFields <- dbGetQuery(conn, sql)
plot_ly(x=motifFields$x, y=motifFields$y, z=motifFields$z, type="scatter3d", mode="markers", text=motifFields$Gene, color=motifFields$Field, colors=fieldColors)
dbDisconnect(conn)
Same test structure, different metric.
We now measure average pairwise coexpression score within a gene group.
sql <- "SELECT l.*, CASE WHEN Field IS NULL THEN 'None' ELSE Field END AS Field FROM Loci l LEFT JOIN CoexFields f ON l.Gene = f.Gene"
fieldColors <- c("#FF0000", "#00FF00", "#0000FF", "#FFFF00", "#00FFFF", "#4488AA", "#FF00FF", "#000000")
conn <- dbConnect(RSQLite::SQLite(), "../../Results/yeast.sqlite")
coexFields <- dbGetQuery(conn, sql)
plot_ly(x=coexFields$x, y=coexFields$y, z=coexFields$z, type="scatter3d", mode="markers", text=coexFields$Gene, color=coexFields$Field, colors=fieldColors)
dbDisconnect(conn)
Using a python script to calculate minimum and maximum overlaps between all fields. MinimumOverlap = commonGeneCount / max(geneCount) MaximumOverlap = commonGeneCount / min(geneCount)
Keeping pairs with a maximum overlap greater than 0.5:
| Field A | Field B | Common genes | Min Overlap | Max Overlap |
|---|---|---|---|---|
| PromoterFields:A | CoexFields:G | 62 | 0.04 | 1.0 |
| PromoterFields:B | CoexFields:G | 116 | 0.08 | 0.97 |
| PromoterFields:C | CoexFields:F | 120 | 0.12 | 1.0 |
| PromoterFields:D | CoexFields:F | 160 | 0.16 | 1.0 |
| PromoterFields:F | CoexFields:G | 192 | 0.13 | 0.89 |
| PromoterFields:G | CoexFields:F | 218 | 0.22 | 0.96 |
| MotifFields:B | CoexFields:G | 82 | 0.05 | 0.85 |
We observe that only CoexFields have overlaps with other kinds of fields, and fields F and G in particular, which are the larger ones. Taking a closer look at the rendering, we see that coexpression fields occupy different hemispheres of the nucleus. The hemispheres are separated by a sea of “insulator”. This is nicely demonstrated by clusters A and B. They are different sectors of the Hat region, and they face away from each other and towards clusters F and G (respectively) of the bubble. These two sectors of the hat are separated by the same insulator that separates the rest of the nucleus.
We also get a distinct “south pole” made by the centromeres.
Let’s now combine these into a set of 4 clusters, using continent names as a consise indication of geometry:
CREATE TABLE Continents(Gene TEXT, Continent Text)
DROP TABLE IF EXISTS Continents;
CREATE TABLE Continents(Gene TEXT, Continent TEXT);
INSERT INTO Continents(Gene, Continent) SELECT Gene, 'Laurasia' FROM CoexFields WHERE Field = 'B' OR Field = 'C' OR Field = 'D' OR Field = 'G';
INSERT INTO Continents(Gene, Continent) SELECT Gene, 'Godwana' FROM CoexFields WHERE Field = 'A' OR Field = 'F';
INSERT INTO Continents(Gene, Continent) SELECT Gene, 'Antarctica' FROM CoexFields WHERE Field = 'E';
INSERT INTO Continents(Gene, Continent) SELECT Gene, 'Tethys' FROM Loci WHERE Gene NOT IN (SELECT Gene FROM Continents);
Here is how it looks:
sql <- "SELECT l.*, Continent FROM Loci l JOIN Continents c ON l.Gene = c.Gene"
fieldColors <- c("#FF0000", "#00FF00", "#0000FF", "#FFFF00", "#00FFFF", "#4488AA", "#FF00FF", "#000000")
conn <- dbConnect(RSQLite::SQLite(), "../../Results/yeast.sqlite")
continents <- dbGetQuery(conn, sql)
plot_ly(x=continents$x, y=continents$y, z=continents$z, type="scatter3d", mode="markers", text=continents$Gene, color=continents$Continent, colors=fieldColors)
dbDisconnect(conn)
And here are the respective overlaps. We have more fields overlapping now:
| Field | Continent | Common genes | Min Overlap | Max Overlap |
|---|---|---|---|---|
| PromoterFields:A | Laurasia | 62 | 0.027 | 1.0 |
| PromoterFields:B | Laurasia | 116 | 0.051 | 0.97 |
| PromoterFields:F | Laurasia | 192 | 0.08 | 0.89 |
| MotifFields:B | Laurasia | 82 | 0.04 | 0.85 |
| —————- | ———– | ———— | ———– | ———– |
| PromoterFields:G | Godwana | 218 | 0.21 | 0.96 |
| PromoterFields:C | Godwana | 120 | 0.12 | 1.0 |
| PromoterFields:D | Godwana | 160 | 0.16 | 1.0 |
| —————- | ———– | ———— | ———– | ———– |
| PromoterFields:E | Tethys | 98 | 0.035 | 0.57 |
| MotifFields:A | Tethys | 55 | 0.02 | 0.96 |
| MotifFields:C | Tethys | 76 | 0.03 | 0.78 |
10000 samples of pairwise coexpression score for each continent:
continentCoexpression <- read.table(file = '../../Results/ContinentCoexpressionScoreSamples.tsv', sep = '\t', header = TRUE)
boxplot(continentCoexpression, outline=FALSE, las=2)
The important distinction is that of “sea versus land”. Any combination of the 3 land continents will yield the same (better than average) coexpression score:
continentCoexpression <- read.table(file = '../../Results/CombinedContinentCoexpressionScoreSamples.tsv', sep = '\t', header = TRUE)
boxplot(continentCoexpression, outline=FALSE, las=2)
| Compartment | p-value |
|---|---|
| Godwana | 1.18e-16 |
| Antarctica | 1.03e-25 |
| Laurasia | 7.02e-09 |
| Tethys | 1.15e-54 |
| Pangaea | 7.85e-14 |
t.test(continentCoexpression$Godwana, continentCoexpression$Random)
##
## Welch Two Sample t-test
##
## data: continentCoexpression$Godwana and continentCoexpression$Random
## t = 6.4204, df = 19998, p-value = 1.389e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.02866381 0.05385619
## sample estimates:
## mean of x mean of y
## 1.09641 1.05515
t.test(continentCoexpression$Antarctica, continentCoexpression$Random)
##
## Welch Two Sample t-test
##
## data: continentCoexpression$Antarctica and continentCoexpression$Random
## t = 7.4618, df = 19969, p-value = 8.886e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.03466125 0.05935875
## sample estimates:
## mean of x mean of y
## 1.10216 1.05515
t.test(continentCoexpression$Laurasia, continentCoexpression$Random)
##
## Welch Two Sample t-test
##
## data: continentCoexpression$Laurasia and continentCoexpression$Random
## t = 4.8569, df = 19997, p-value = 1.202e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.01865038 0.04388962
## sample estimates:
## mean of x mean of y
## 1.08642 1.05515
t.test(continentCoexpression$Tethys, continentCoexpression$Random)
##
## Welch Two Sample t-test
##
## data: continentCoexpression$Tethys and continentCoexpression$Random
## t = -17.27, df = 19958, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1207367 -0.0961233
## sample estimates:
## mean of x mean of y
## 0.94672 1.05515
t.test(continentCoexpression$Pangaea, continentCoexpression$Random)
##
## Welch Two Sample t-test
##
## data: continentCoexpression$Pangaea and continentCoexpression$Random
## t = 5.8854, df = 19988, p-value = 4.033e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.02547789 0.05092211
## sample estimates:
## mean of x mean of y
## 1.09335 1.05515
t.test(continentCoexpression$Godwana, continentCoexpression$Pangaea)
##
## Welch Two Sample t-test
##
## data: continentCoexpression$Godwana and continentCoexpression$Pangaea
## t = 0.47086, df = 19990, p-value = 0.6377
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.009678111 0.015798111
## sample estimates:
## mean of x mean of y
## 1.09641 1.09335
t.test(continentCoexpression$Antarctica, continentCoexpression$Pangaea)
##
## Welch Two Sample t-test
##
## data: continentCoexpression$Antarctica and continentCoexpression$Pangaea
## t = 1.3822, df = 19926, p-value = 0.1669
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.003683477 0.021303477
## sample estimates:
## mean of x mean of y
## 1.10216 1.09335
t.test(continentCoexpression$Laurasia, continentCoexpression$Pangaea)
##
## Welch Two Sample t-test
##
## data: continentCoexpression$Laurasia and continentCoexpression$Pangaea
## t = -1.0644, df = 19993, p-value = 0.2872
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.019691278 0.005831278
## sample estimates:
## mean of x mean of y
## 1.08642 1.09335
Note that Pangaea is actually more tightly coexpressed than its subset “Laurasia”. Thus, Laurasian genes have a greater coexpression score with genes in Godwana and Antarctica, while the latter two are tighter than Pangaea.
“Tethys” (coexpression insulator) cluster has a 2% PAdj for “Metabolit pathways”.
These “Internal” gene sets have parts in all continents:
SELECT Chromosome, Continent, COUNT(Continent) FROM TightCommunities JOIN Continents
ON TightCommunities.Gene = Continents.Gene
JOIN Loci ON Loci.Gene = TightCommunities.Gene
GROUP BY Chromosome,Continent
| Chromosome | Continent | GeneCount |
|---|---|---|
| 2 | Godwana | 34 |
| 4 | Antarctica | 22 |
| 4 | Laurasia | 26 |
| 4 | Tethys | 13 |
| 7 | Laurasia | 21 |
| 7 | Tethys | 44 |
| 10 | Antarctica | 7 |
| 10 | Laurasia | 27 |
| 10 | Tethys | 21 |
| 11 | Antarctica | 22 |
| 11 | Laurasia | 25 |
| 11 | Tethys | 10 |
| 13 | Godwana | 24 |
| 13 | Tethys | 13 |
| 14 | Laurasia | 68 |
| 16 | Laurasia | 18 |
| 16 | Tethys | 12 |
| Continent | GeneCount | SSD | SsdRatio | SsdEnrich | WGD | WGDRatio | WGDEnrich |
|---|---|---|---|---|---|---|---|
| Antarctica | 417.000 | 65 | 0.156 | 1.012 | 71 | 0.170 | 1.015 |
| Godwana | 1026.000 | 159 | 0.155 | 1.006 | 204 | 0.199 | 1.185 |
| Laurasia | 2266.000 | 345 | 0.152 | 0.988 | 431 | 0.190 | 1.134 |
| Tethys | 2787.000 | 414 | 0.149 | 0.964 | 361 | 0.130 | 0.772 |
The following diagram summarizes the flow of data, beginning from our inputs (signals) and ending at results.
From sources to results
This image may serve as a guide for reproducing parts of this work. The associated SVG file can be found in r/Chapters/images/FromSourcesToResults.svg and conveniently zoomed at. Also the source graph project is available in Diagrams folder (FromSourcesToResults.graphml).
The light gray frame in the picture above represents the database. We use a single sqlite file as a central data structure. Various scripts and programs, independent from one another, perform the required transformations. This structure makes it easy to add components at will, with the sole requirement that each component respects the central data structure. This provides for a much cleaner namespace. For instance, all of our programs that do sphere tests, define a “Gene” class. Each time the class has different members (histones, TFs etc) and is loaded from different combinations of tables. This lets each particular program to solely focus on the task at hand and avoids complicated data structures and naming schemes. At the same time, all available data is accessible by all programs.
Let’s see how this new dataset looks:
conn <- dbConnect(RSQLite::SQLite(), "../../Results/yeast.sqlite")
replicationTiming <- dbGetQuery(conn, "SELECT l.Gene, x, y, z, ReplicationTiming FROM Loci l JOIN ReplicationTiming r ON l.Gene = r.Gene")
plot_ly(x=replicationTiming$x, y=replicationTiming$y, z=replicationTiming$z, type="scatter3d", mode="markers", text=replicationTiming$Gene, color=replicationTiming$ReplicationTiming)
dbDisconnect(conn)
We use standard deviation of replication timing in a sphere as a metric.
This signal is so smooth that almost all possible spheres have a far below metric than the randomized set of same size.
Maybe it makes more sense to find the rare spheres where smoothness is disrupted:
sql <- "SELECT l.*, CASE WHEN Field IS NULL THEN 'None' ELSE Field END AS Field FROM Loci l LEFT JOIN ReplicationTimingFields f ON l.Gene = f.Gene"
fieldColors <- c("#FF0000", "#00FF00", "#0000FF", "#FFFF00", "#00FFFF", "#4488AA", "#FF00FF", "#000000")
conn <- dbConnect(RSQLite::SQLite(), "../../Results/yeast.sqlite")
replicationTimingFields <- dbGetQuery(conn, sql)
plot_ly(x=replicationTimingFields$x, y=replicationTimingFields$y, z=replicationTimingFields$z, type="scatter3d", mode="markers", text=replicationTimingFields$Gene, color=replicationTimingFields$Field, colors=fieldColors)
dbDisconnect(conn)
Field C is the same as MotifFields A, which was again the weird one, having more variance in TFs than expected by chance. These 123 genes are all located in chrXII, from ~300Kb to ~800KB, with a 200Kb gap between ~500-700.
Maybe we have the wrong (index space) coordinates for these genes? That would explain having significantly varied signals for both signal types.
Alternatively, this weird behavior is because of the rDNA that is present there.
sql <- "SELECT Gene, Chromosome, Start,End FROM Loci WHERE Gene IN (SELECT Gene FROM ReplicationTimingFields WHERE Field = 'C') ORDER BY Start"
conn <- dbConnect(RSQLite::SQLite(), "../../Results/yeast.sqlite")
weirdGenes <- dbGetQuery(conn, sql)
dbDisconnect(conn)
weirdGenes
## Gene Chromosome Start End
## 1 YLR091W 12 322297 323179
## 2 YLR092W 12 323544 326226
## 3 YLR093C 12 326513 327416
## 4 YLR094C 12 327730 329239
## 5 YLR095C 12 329677 332116
## 6 YLR096W 12 332590 336034
## 7 YLR097C 12 336231 337266
## 8 YLR098C 12 337527 339474
## 9 YLR099C 12 339744 340929
## 10 YLR099W 12 341325 341589
## 11 YLR100W 12 341810 342854
## 12 YLR101C 12 342566 342962
## 13 YLR102C 12 342970 343768
## 14 YLR103C 12 343989 345942
## 15 YLR107W 12 364116 365331
## 16 snR6 12 366235 366347
## 17 YLR108C 12 366667 368125
## 18 YLR109W 12 368781 369312
## 19 YLR110C 12 369697 370099
## 20 YLR111W 12 370391 370724
## 21 YLR112W 12 370791 371211
## 22 YLR113W 12 371620 372928
## 23 YLR114C 12 374944 377239
## 24 YLR115W 12 377985 380565
## 25 YLR116W 12 380822 382253
## 26 YLR117C 12 382471 384535
## 27 YLR118C 12 384725 385409
## 28 YLR119W 12 385534 386176
## 29 YLR120C 12 386511 388221
## 30 YLR120W 12 388672 388774
## 31 YLR121C 12 388744 390271
## 32 YLR122C 12 390954 391332
## 33 YLR123C 12 391078 391408
## 34 YLR124W 12 391600 391945
## 35 YLR125W 12 393484 393895
## 36 YLR126C 12 394765 395521
## 37 YLR127C 12 395758 398320
## 38 YLR128W 12 398530 399434
## 39 YLR129W 12 399657 402489
## 40 YLR130C 12 402794 404063
## 41 YLR131C 12 404510 406823
## 42 YLR132C 12 407283 408156
## 43 YLR133W 12 408445 410194
## 44 YLR134W 12 410723 412415
## 45 YLR135W 12 413281 415528
## 46 YLR136C 12 415801 416659
## 47 YLR137W 12 417006 418110
## 48 YLR138W 12 418437 421395
## 49 YLR139C 12 421542 423474
## 50 YLR140W 12 423474 423801
## 51 YLR141W 12 423683 424775
## 52 YLR142W 12 425186 426617
## 53 YLR143W 12 427329 429387
## 54 YLR144C 12 429677 432017
## 55 YLR145W 12 432168 432774
## 56 YLR146C 12 432823 433726
## 57 YLR146W 12 433870 434059
## 58 YLR147C 12 434158 434464
## 59 YLR148W 12 434641 437398
## 60 YLR149C 12 437631 439824
## 61 YLR150W 12 440467 441289
## 62 YLR151C 12 441715 442738
## 63 YLR152C 12 442958 444689
## 64 YLR153C 12 445524 447576
## 65 YLR154C 12 447982 448315
## 66 RDN25 12 451786 455182
## 67 RDN37 12 451786 457733
## 68 YLR154W 12 452438 452624
## 69 RDN58 12 455414 455572
## 70 RDN18 12 455933 457733
## 71 RDN5 12 459676 459797
## 72 YLR155C 12 469317 470406
## 73 YLR156W 12 472113 472458
## 74 YLR156C 12 472479 472611
## 75 YLR157C 12 472969 474058
## 76 YLR157W 12 475764 475977
## 77 YLR158C 12 482549 483638
## 78 YLR159W 12 485345 485690
## 79 YLR159C 12 485711 485843
## 80 YLR160C 12 486201 487290
## 81 YLR161W 12 488997 489342
## 82 YLR162W 12 489573 489930
## 83 YLR163C 12 491867 493256
## 84 YLR163W 12 492813 492927
## 85 YLR164W 12 493884 494391
## 86 YLR165C 12 494495 495260
## 87 YLR166C 12 495430 498046
## 88 YLR167W 12 498948 499407
## 89 YLR168C 12 499579 500272
## 90 YLR169W 12 500336 500690
## 91 YLR170C 12 500580 501051
## 92 YLR171W 12 500734 501124
## 93 YLR172C 12 501261 502164
## 94 YLR173W 12 502422 504249
## 95 YLR174W 12 504592 505831
## 96 YLR175W 12 506135 507587
## 97 YLR176C 12 507798 510234
## 98 YLR177W 12 511055 512942
## 99 YLR178C 12 513163 513823
## 100 YLR179C 12 514109 514715
## 101 YLR180W 12 515263 516412
## 102 YLR181C 12 516679 517672
## 103 YLR182W 12 517941 520353
## 104 YLR183C 12 520544 522014
## 105 YLR184W 12 522107 522455
## 106 YLR185W 12 522664 523290
## 107 YLR186W 12 523633 524392
## 108 YLR310C 12 752225 756995
## 109 YLR311C 12 757266 757614
## 110 YLR312C 12 757638 758835
## 111 YLR312W 12 759481 760243
## 112 YLR313C 12 760356 762342
## 113 YLR314C 12 762574 764137
## 114 YLR315W 12 764807 765269
## 115 YLR316C 12 765265 766358
## 116 YLR317W 12 765654 766089
## 117 YLR324W 12 779214 780786
## 118 YLR325C 12 781142 781379
## 119 YLR326W 12 782173 782896
## 120 YLR327C 12 783126 783387
## 121 YLR328W 12 784912 786118
## 122 YLR329W 12 786441 787333
## 123 YLR330W 12 787663 789679